1 00:00:00,025 --> 00:00:06,190 [SOUND] Hello everyone. 2 00:00:06,190 --> 00:00:09,880 Welcome back to the Heterogeneous Parallel Programming class. 3 00:00:09,880 --> 00:00:14,450 This is lecture 1.5. Introduction to CUDA 4 00:00:14,450 --> 00:00:19,389 in we are in the memory allocation and data movement API functions part. 5 00:00:20,720 --> 00:00:25,500 The objective of this lecture is to help you to learn the basic application 6 00:00:25,500 --> 00:00:30,640 programming interface functions or API functions in CUDA host code. 7 00:00:30,640 --> 00:00:33,830 API functions is a standard way 8 00:00:33,830 --> 00:00:38,950 for industry to extend standard programming languages. 9 00:00:38,950 --> 00:00:43,420 To support certain specialized functionalities. 10 00:00:43,420 --> 00:00:46,830 In this case, the CUDA designers and NVIDIA 11 00:00:46,830 --> 00:00:50,780 provided these API functions to help C programmers to 12 00:00:50,780 --> 00:00:55,660 use throughput oriented devices such GPUs in a Heterogeneous 13 00:00:55,660 --> 00:00:59,410 computing system The two types of API functions that we 14 00:00:59,410 --> 00:01:02,950 will be looking at today are the device memory 15 00:01:02,950 --> 00:01:07,710 allocation function and also the host device data transfer functions. 16 00:01:09,380 --> 00:01:15,960 This slide shows the reviews that vector addition example, and we used this 17 00:01:15,960 --> 00:01:20,460 example to illustrate data parallelism and today we're going to 18 00:01:20,460 --> 00:01:25,070 use this example to show how you can easily a 19 00:01:25,070 --> 00:01:29,870 standard sequential C program to do vector addition into a a 20 00:01:29,870 --> 00:01:35,570 heterogeneous parallel piece of code to do the same functionality. 21 00:01:35,570 --> 00:01:41,310 And just a reminder that each CUDA thread will 22 00:01:41,310 --> 00:01:44,630 be doing, will be adding an element of A 23 00:01:44,630 --> 00:01:47,200 and element of B and assigned to element of C. 24 00:01:48,480 --> 00:01:54,330 So here we will look at a traditional C code for doing vector addition. 25 00:01:54,330 --> 00:02:01,120 So, here we show that in the main function we in C we will be doing memory allocation 26 00:02:01,120 --> 00:02:06,370 and then we'll do some IO read to reading A and B and then we will need to 27 00:02:07,800 --> 00:02:12,190 determine that elements in A and B so [INAUDIBLE] N 28 00:02:12,190 --> 00:02:15,550 So at some point, we would want to do vector addition. 29 00:02:15,550 --> 00:02:20,200 So we call vecAdd function with four parameters. 30 00:02:20,200 --> 00:02:27,360 We give the pointers to, to A pointer, pointers to A, B, C and then 31 00:02:27,360 --> 00:02:33,280 the number of elements in these vectors. So, up here, we show a 32 00:02:33,280 --> 00:02:37,360 simple sequential function for doing vector addition in C. 33 00:02:38,650 --> 00:02:43,710 It matches the parameters that our main function is going to pass to it. 34 00:02:43,710 --> 00:02:45,957 So it's going to say that, oh I 35 00:02:45,957 --> 00:02:52,070 have three pointers are pointing to floating point arrays. 36 00:02:52,070 --> 00:02:54,350 So it will be A, B and C and then it 37 00:02:54,350 --> 00:02:58,580 expects a integer value that gives the number of elements in 38 00:02:58,580 --> 00:03:03,180 A, B and C. Then there is going to be a loop, for loop 39 00:03:03,180 --> 00:03:08,470 with a loop variable i and this for loop was sequentially goes through all the 40 00:03:08,470 --> 00:03:14,220 elements of A, B add them up and assign to the corresponding C element. 41 00:03:14,220 --> 00:03:18,360 So this is a sequential code, because the root will sequentially 42 00:03:18,360 --> 00:03:23,148 visit all the A B elements and generate the corresponding C element. 43 00:03:23,148 --> 00:03:28,230 Now we're going to show how we can systematically 44 00:03:28,230 --> 00:03:32,510 convert this piece of code into a Parallel CUDA code. 45 00:03:34,590 --> 00:03:35,090 The 46 00:03:37,150 --> 00:03:43,920 this slide shows the outline of how we can change that vector add function 47 00:03:43,920 --> 00:03:49,780 to be, to use the throughput-oriented GPU devices. 48 00:03:49,780 --> 00:03:56,360 So instead of performing the, the actual complication, this function is actually 49 00:03:56,360 --> 00:04:02,580 going to call a kernel function that will be executed on the device and in, 50 00:04:02,580 --> 00:04:08,240 before calling that function the this function also needs to do some outsourcing 51 00:04:08,240 --> 00:04:13,420 activity. It needs to copy data from the host memory 52 00:04:13,420 --> 00:04:19,455 into the device memory, so that the device is ready to, to process the data. 53 00:04:19,455 --> 00:04:23,160 And eventually after the device completes it's computation and 54 00:04:23,160 --> 00:04:27,580 needs to copy the data's C vector back into the 55 00:04:27,580 --> 00:04:29,100 host memory. 56 00:04:29,100 --> 00:04:35,390 So when we look at the, the function there's also a header file 57 00:04:35,390 --> 00:04:40,190 that you will need to include in order for that for this function to work properly. 58 00:04:40,190 --> 00:04:46,480 This is the include file of cuda.h. So this is a line that you need to add to 59 00:04:46,480 --> 00:04:52,790 your CUDA files in order for for the, the functions to be the host code to be 60 00:04:52,790 --> 00:04:58,260 able to, to get access to all the API functions properly. 61 00:04:58,260 --> 00:05:03,130 So, here we show three main parts in the host code. 62 00:05:03,130 --> 00:05:06,070 The first part is to allocate device memory for A, B 63 00:05:06,070 --> 00:05:10,210 and C and copy A and B to the device memory. 64 00:05:10,210 --> 00:05:14,280 So this is illustrated in the top picture here where we have part 65 00:05:14,280 --> 00:05:17,940 one that is copying data from the host memory to a device memory 66 00:05:17,940 --> 00:05:21,060 after we allocated them in the device memory. 67 00:05:21,060 --> 00:05:25,730 The second part is for the host code to launch the 68 00:05:25,730 --> 00:05:31,060 kernel function, and this will be the topic for the next lecture. 69 00:05:31,060 --> 00:05:36,650 And after the, the the kernel completes its execution, the host code 70 00:05:36,650 --> 00:05:42,310 part three will copy C the output the result of the computation. 71 00:05:42,310 --> 00:05:43,000 Back from the 72 00:05:43,000 --> 00:05:45,120 device memory to the host memory. 73 00:05:45,120 --> 00:05:47,980 And this is illustrated as part C in the picture. 74 00:05:51,770 --> 00:05:55,590 In order to really understand what's going on with these API functions, 75 00:05:55,590 --> 00:06:01,900 you need to have a good, a conceptual understanding of the CUDA memories. 76 00:06:01,900 --> 00:06:05,270 And this picture is actually a simplified picture and we do 77 00:06:05,270 --> 00:06:09,420 have all the CUDA memory, device memory types in this picture. 78 00:06:09,420 --> 00:06:14,050 However, we're showing two important parts that you 79 00:06:14,050 --> 00:06:17,460 will need to understand in order to understand 80 00:06:17,460 --> 00:06:20,720 API functions and, and immediately in the next 81 00:06:20,720 --> 00:06:23,680 lecture you also need to use, need to understand. 82 00:06:23,680 --> 00:06:29,556 The registers in order to understand a simple piece of kernel code. 83 00:06:29,556 --> 00:06:33,670 So the the simple way of looking at this 84 00:06:33,670 --> 00:06:37,940 is that each device will have many, many threads. 85 00:06:37,940 --> 00:06:43,880 So remember, these threads are actually virtualized Von Neumann processors. 86 00:06:43,880 --> 00:06:48,090 So you can think about these threads as processors. 87 00:06:48,090 --> 00:06:51,650 And each processor will have a set of registers. 88 00:06:51,650 --> 00:06:57,380 And so these registers hold variables that are private to the thread. 89 00:06:58,490 --> 00:07:04,460 And then all the threads will have access to a shared global memory. 90 00:07:04,460 --> 00:07:07,810 So the, this is going to be also important. 91 00:07:07,810 --> 00:07:09,050 In the kernel, 92 00:07:09,050 --> 00:07:12,790 we're going to see that some of the accesses are 93 00:07:12,790 --> 00:07:16,900 going to be to the shared global memory variables. 94 00:07:18,000 --> 00:07:22,110 For the purpose of this lecture the more important part is 95 00:07:22,110 --> 00:07:28,330 that the host code can actually allocate memory in the global memory. 96 00:07:28,330 --> 00:07:33,830 And also request data copy from the global mem from the host memory to the global 97 00:07:33,830 --> 00:07:39,950 memory and and vice versa that is from global memory back to host memory. 98 00:07:39,950 --> 00:07:44,490 We will cover more memory types later in the, in 99 00:07:44,490 --> 00:07:48,030 the subsequent lectures when we talk about localities and so on. 100 00:07:50,990 --> 00:07:56,960 The first type of API functions that we're going to focus on is that CUDA 101 00:07:56,960 --> 00:08:03,910 Device Memory Management API functions. Mainly this cudaMalloc and cudaFree. 102 00:08:03,910 --> 00:08:08,632 cudaMalloc function allocates objects in a device global memory, right here. 103 00:08:08,632 --> 00:08:16,160 And it would it will take two parameters, one is the address of a pointer to 104 00:08:16,160 --> 00:08:18,520 the allocated object, and the other one is 105 00:08:18,520 --> 00:08:22,360 the size of allocated object in terms of bytes. 106 00:08:22,360 --> 00:08:25,610 For C programmers this is the pretty much 107 00:08:25,610 --> 00:08:28,250 the same as the Malloc function because the 108 00:08:28,250 --> 00:08:32,070 malloc function in C also requires the size 109 00:08:32,070 --> 00:08:35,400 to be the allocated object in terms of bytes. 110 00:08:36,458 --> 00:08:42,730 The difference that you will notice as a C programmer is that C malloc 111 00:08:42,730 --> 00:08:47,110 actually returns a pointer value to the allocated object. 112 00:08:47,110 --> 00:08:53,200 But here we actually passed the address of a pointer to the allocated pointer and 113 00:08:53,200 --> 00:08:56,590 then the func, the allocation function is 114 00:08:56,590 --> 00:09:01,380 going to extentiate the value into the pointer. 115 00:09:01,380 --> 00:09:05,540 So that this is really a call by reference activity. 116 00:09:05,540 --> 00:09:08,280 The reason why it's different is because 117 00:09:08,280 --> 00:09:12,212 all the CUDA API functions return error code, which 118 00:09:12,212 --> 00:09:15,750 we'll, we'll see in the next in a few slides. 119 00:09:15,750 --> 00:09:19,720 So, because the return value is always the 120 00:09:19,720 --> 00:09:24,126 error code then the only way that that cudaMalloc 121 00:09:24,126 --> 00:09:28,740 function can systematically return a pointer to the allocated 122 00:09:28,740 --> 00:09:33,790 object is by doing this call by reference convention. 123 00:09:34,800 --> 00:09:41,730 The second function is cudaFree and cudaFree essentially frees the object 124 00:09:41,730 --> 00:09:47,375 from the global memory so that the memory space can be, can be recycled. 125 00:09:47,375 --> 00:09:52,430 And, it only takes one parameter which is a pointer to the free object. 126 00:09:52,430 --> 00:09:56,930 Be very careful here the parameter in cudaFree is 127 00:09:56,930 --> 00:10:00,320 the pointer to the free object whereas the parameter, 128 00:10:00,320 --> 00:10:03,530 the first parameter to the cudaMalloc function is 129 00:10:03,530 --> 00:10:07,290 the address of the pointer to the allocated object. 130 00:10:11,240 --> 00:10:14,890 The second category of API functions that we'll be 131 00:10:14,890 --> 00:10:19,420 using today is the host device data transfer API functions. 132 00:10:19,420 --> 00:10:26,890 And this is mainly the cudaMemcpy or a function memcpy. 133 00:10:26,890 --> 00:10:32,140 CUDA mem copy function, this is also fashioned after the C 134 00:10:32,140 --> 00:10:36,830 mem copy function. It performs memory data transfer and 135 00:10:36,830 --> 00:10:42,650 it requires four parameters, the first parameter is pointer to the destination. 136 00:10:42,650 --> 00:10:46,420 And the second one is pointed to the source, the third one is number of 137 00:10:46,420 --> 00:10:50,960 files to be copied, and the fourth one is the type or direction of the transfer. 138 00:10:50,960 --> 00:10:57,280 And we typically will use predefined constants in CUDA to indicate this type, 139 00:10:57,280 --> 00:11:02,180 and as we will see in the next slide. So the, the transfer 140 00:11:02,180 --> 00:11:07,040 to the device by this function is asynchronous, meaning 141 00:11:07,040 --> 00:11:12,170 that we can request one copy by calling cudaMemcpy. 142 00:11:12,170 --> 00:11:16,800 But cudaMemcpy will return to the, will return right away, even before 143 00:11:16,800 --> 00:11:21,650 the copy is complete, so that we can immediately request another memcpy. 144 00:11:21,650 --> 00:11:24,170 And this is actually really very important when 145 00:11:24,170 --> 00:11:28,560 we begin to utilize task level parallelism, and 146 00:11:28,560 --> 00:11:32,820 so we're going to come back to this point later in the course. 147 00:11:36,460 --> 00:11:38,670 So now we're, that we, we have 148 00:11:38,670 --> 00:11:44,560 introduced the cudaMalloc, cudaFree, and cudaMemcpy were 149 00:11:44,560 --> 00:11:47,400 ready to convert our vector addition example 150 00:11:47,400 --> 00:11:51,130 code, host code, into the real host code. 151 00:11:51,130 --> 00:11:55,430 So this function now is no longer just the outline. 152 00:11:55,430 --> 00:11:58,730 We actually have all the statements that 153 00:11:58,730 --> 00:12:02,620 implement that those parts, several part two. 154 00:12:02,620 --> 00:12:07,670 So in part one now we have a declaration of DA, 155 00:12:07,670 --> 00:12:12,670 DB and DC and these are pointers that 156 00:12:12,670 --> 00:12:17,970 through the object allocated object in the device memory. 157 00:12:17,970 --> 00:12:22,820 And so the first cudaMalloc function Is 158 00:12:22,820 --> 00:12:28,410 going to allocate the device memory for vector A. 159 00:12:28,410 --> 00:12:34,170 And as you can see, the size calculation is n times the size of float. 160 00:12:34,170 --> 00:12:37,060 So each float in CUDA is four bytes, so 161 00:12:37,060 --> 00:12:40,820 therefore, n is the number of elements in the vector. 162 00:12:40,820 --> 00:12:47,480 So this gives us the size in terms of bytes, now we have cudaMemcpy and the 163 00:12:47,480 --> 00:12:50,200 destination is the device memory, that's why it 164 00:12:50,200 --> 00:12:53,590 is D_A, the sources in the host memory 165 00:12:53,590 --> 00:12:55,190 that's why it's H_A. 166 00:12:55,190 --> 00:12:58,090 The size gives the number of bytes and then 167 00:12:58,090 --> 00:13:03,050 there is the pre-defined constant cudaMemCpy host to device. 168 00:13:03,050 --> 00:13:07,170 And this constant is actually defined in that cuda.h 169 00:13:07,170 --> 00:13:11,740 file that you included in the in the source file. 170 00:13:11,740 --> 00:13:18,198 And we have once we allocated that memory, we can go and do the KUDA main copy, 171 00:13:18,198 --> 00:13:24,380 we can do the same thing for Vector B. So we 172 00:13:24,380 --> 00:13:30,900 allocate B and we do the copy from host B of B from host memory to device memory. 173 00:13:32,390 --> 00:13:37,690 We allocate memory for C. We know, we don't need to copy from 174 00:13:37,690 --> 00:13:43,200 host to C because C is the result of the computation, so the kernel 175 00:13:43,200 --> 00:13:46,320 is going to generate the, all the values of C. 176 00:13:46,320 --> 00:13:51,360 Part two remains remains a comment here, because we're going 177 00:13:51,360 --> 00:13:54,410 to come back in the next lecture to complete this 178 00:13:54,410 --> 00:13:58,790 part and then part three is to copy the data 179 00:13:58,790 --> 00:14:03,880 the result from the device memory back into the host memory. 180 00:14:03,880 --> 00:14:08,750 And we are going to use a constant CUDA mem copy device 181 00:14:08,750 --> 00:14:14,320 to host, to indicate the direction of the copy, and after we're done 182 00:14:14,320 --> 00:14:18,110 we can just go ahead and free A, B, and C from the device. 183 00:14:19,330 --> 00:14:24,310 So this piece of code is, gives you the all the 184 00:14:24,310 --> 00:14:27,210 code that needs to, that you need to have to allocate 185 00:14:27,210 --> 00:14:31,310 memory, to copy data, in preparation for the kernel execution and 186 00:14:31,310 --> 00:14:34,260 then copy the result data back and free up the memory. 187 00:14:36,090 --> 00:14:41,500 In general, when we actually try to, to get performance out 188 00:14:41,500 --> 00:14:46,120 of this kind of code, we cannot afford to copy 189 00:14:46,120 --> 00:14:51,580 data back and forth before and after each kernel invocation. 190 00:14:51,580 --> 00:14:56,600 So for real applications, we tend to have [UNKNOWN] data that just 191 00:14:56,600 --> 00:15:01,340 reside in the device memory, and then we just keep launching Kernels to 192 00:15:01,340 --> 00:15:04,500 perform computation on the device memory. 193 00:15:04,500 --> 00:15:11,770 So but because this is the beginning of our course where it just showing a 194 00:15:11,770 --> 00:15:17,500 very simple example and show you all the pieces of the, that can 195 00:15:17,500 --> 00:15:23,020 be involved, so that you know exactly how to copy, allocate memory, you know 196 00:15:23,020 --> 00:15:26,630 how to copy data from host to device, you know how to copy results 197 00:15:26,630 --> 00:15:28,460 from device to host. 198 00:15:28,460 --> 00:15:33,600 But in a real application, some of these steps may not be necessary, because the 199 00:15:33,600 --> 00:15:39,740 data may just have already been residing in the memory, deep device memory. 200 00:15:39,740 --> 00:15:43,700 Or in some cases, the result can stay on the device memory for future use. 201 00:15:43,700 --> 00:15:44,960 They don't have to be copied back. 202 00:15:48,200 --> 00:15:52,700 In practice what we have shown so far is that we just 203 00:15:52,700 --> 00:15:57,844 go ahead and call it cudaMalloc function and that that's the job. 204 00:15:57,844 --> 00:15:59,870 However, in practice I would like to 205 00:15:59,870 --> 00:16:04,340 encourage you to always check for error conditions. 206 00:16:04,340 --> 00:16:06,696 So here is what you really should have 207 00:16:06,696 --> 00:16:10,504 done when you do, do call a cudaMalloc function. 208 00:16:10,504 --> 00:16:13,754 You should declare a variable 209 00:16:13,754 --> 00:16:19,006 of the type cudaError_t. And this is a predetermine, 210 00:16:19,006 --> 00:16:25,510 defined type in the cuda API. And this is also from that cuda.h file. 211 00:16:25,510 --> 00:16:30,250 And then you can declare a variable in this case we call that variable err. 212 00:16:30,250 --> 00:16:33,280 So that when we call the cudaMalloc, we will 213 00:16:33,280 --> 00:16:37,540 take the return value, assign that to err variable. 214 00:16:37,540 --> 00:16:39,560 And this error 215 00:16:39,560 --> 00:16:42,750 code is going to be checked. 216 00:16:42,750 --> 00:16:48,130 You can check the error code and see if it is cudaSuccess. 217 00:16:48,130 --> 00:16:51,600 Whenever the error code is cudaSuccess, that means 218 00:16:51,600 --> 00:16:54,440 that the function has completed what you asked for. 219 00:16:54,440 --> 00:17:00,320 In this case, cudaMalloc has successfully allocated the size of memory 220 00:17:00,320 --> 00:17:06,570 and assigned that pointer to the object to the d_A pointer. 221 00:17:06,570 --> 00:17:08,440 So that was a success. 222 00:17:08,440 --> 00:17:11,850 However, if the error is not a success, then 223 00:17:11,850 --> 00:17:14,610 you need to actually figure out what went wrong. 224 00:17:14,610 --> 00:17:20,690 In most cases, the reason why there is an error condition in cudaMalloc is because 225 00:17:20,690 --> 00:17:26,830 there's not enough device memory to satisfy this allocation request. 226 00:17:26,830 --> 00:17:32,240 So a good way to bring out the error message is by calling 227 00:17:32,240 --> 00:17:40,410 the cudaGetErrorString API function. This is a function that this also 228 00:17:40,410 --> 00:17:47,140 provided as part of the CUDA API and this will convert error code into a stream. 229 00:17:47,140 --> 00:17:49,070 That's human readable. 230 00:17:49,070 --> 00:17:53,110 And then there just like the standard C functions, you 231 00:17:53,110 --> 00:17:55,040 can use underscore, underscore, file, 232 00:17:55,040 --> 00:17:57,200 underscore, underscore, and underscore, underscore, 233 00:17:57,200 --> 00:18:03,940 line, underscore, underscore, to print out the position where the error happens. 234 00:18:03,940 --> 00:18:08,620 And this is going to be the position with of your error checking right here. 235 00:18:08,620 --> 00:18:10,480 So this is going to give you the line position 236 00:18:10,480 --> 00:18:14,450 of this if, of this printf statement right here 237 00:18:14,450 --> 00:18:17,500 and then you know, you can exit, then you 238 00:18:17,500 --> 00:18:21,028 can exit a function with the exit failure code. 239 00:18:21,028 --> 00:18:22,930 So that instead 240 00:18:22,930 --> 00:18:27,780 of executing this code with the error already happened you can 241 00:18:27,780 --> 00:18:32,930 actually ex, ex, exit the function and debug your function 242 00:18:32,930 --> 00:18:38,070 and see why there's not a sufficient memory to satisfy this cudaMalloc request. 243 00:18:40,240 --> 00:18:46,040 in, in the future slides, I'm still going to be showing cudaMalloc, cudaMemCpy 244 00:18:46,040 --> 00:18:51,790 clauses all with this error code checking because that keeps the slide simple. 245 00:18:51,790 --> 00:18:55,800 But when you do the lab assignments, I really would like 246 00:18:55,800 --> 00:18:59,430 you to, encourage you to use this kind of error code sequence. 247 00:18:59,430 --> 00:19:02,500 Even though it makes your code a lot bigger, in the long run, 248 00:19:02,500 --> 00:19:05,560 it will save you a lot of time and a lot of a 249 00:19:05,560 --> 00:19:12,170 lot of stress, because it will help you to catch any of these 250 00:19:12,170 --> 00:19:18,880 errors so that you, it will be a lot easier for you to debug your program. 251 00:19:18,880 --> 00:19:24,660 So, now we have completed a very quick introduction to the CUDA APIs that help 252 00:19:24,660 --> 00:19:31,100 you to do memory allocation and data transfer, and we also showed a 253 00:19:31,100 --> 00:19:36,950 error reporting String API function, so if you are 254 00:19:36,950 --> 00:19:41,910 interested in understanding more about the CUDA 255 00:19:41,910 --> 00:19:46,590 API functions please read chapter three of the textbook, thank you.